Goto

Collaborating Authors

 Denton County



'It Was Nuts': The Extreme Tests that Show Why Hail Is a Multibillion-Dollar Problem

WIRED

'It Was Nuts': The Extreme Tests that Show Why Hail Is a Multibillion-Dollar Problem The costs of a hail damage have ballooned over the past two decades, prompting researchers to resort to extreme measures to understand how these storms destroy buildings. The scars left on houses look like shotgun blasts, sometimes. In the aftermath of major storms, Andrew Shick, owner and chief executive of Illinois-based firm Roofing USA, has driven through suburbs blasted by hail and been left stunned by the damage. Earlier this year, he visited a farm complex in western Illinois where roofs, even sturdy metal ones, were left pockmarked and perforated after 3-inch balls of ice fell from the sky. "It was nuts," he recalls.

  Country:
  Genre:
  Industry:

ReviewGuard: Enhancing Deficient Peer Review Detection via LLM-Driven Data Augmentation

Zhang, Haoxuan, Li, Ruochi, Shrestha, Sarthak, Mamidala, Shree Harshini, Putta, Revanth, Aggarwal, Arka Krishan, Xiao, Ting, Ding, Junhua, Chen, Haihua

arXiv.org Artificial Intelligence

Peer review serves as the gatekeeper of science, yet the surge in submissions and widespread adoption of large language models (LLMs) in scholarly evaluation present unprecedented challenges. While recent work has focused on using LLMs to improve review efficiency, unchecked deficient reviews from both human experts and AI systems threaten to systematically undermine academic integrity. To address this issue, we introduce ReviewGuard, an automated system for detecting and categorizing deficient reviews through a four-stage LLM-driven framework: data collection from ICLR and NeurIPS on OpenReview, GPT-4.1 annotation with human validation, synthetic data augmentation yielding 6,634 papers with 24,657 real and 46,438 synthetic reviews, and fine-tuning of encoder-based models and open-source LLMs. Feature analysis reveals that deficient reviews exhibit lower rating scores, higher self-reported confidence, reduced structural complexity, and more negative sentiment than sufficient reviews. AI-generated text detection shows dramatic increases in AI-authored reviews since ChatGPT's emergence. Mixed training with synthetic and real data substantially improves detection performance - for example, Qwen 3-8B achieves recall of 0.6653 and F1 of 0.7073, up from 0.5499 and 0.5606 respectively. This study presents the first LLM-driven system for detecting deficient peer reviews, providing evidence to inform AI governance in peer review. Code, prompts, and data are available at https://github.com/haoxuan-unt2024/ReviewGuard



SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Neural Information Processing Systems

Recently Transformer has been largely explored in tracking and shown state-of-the-art (SOT A) performance. However, existing efforts mainly focus on fusing and enhancing features generated by convolutional neural networks (CNNs). The potential of Transformer in representation learning remains under-explored.



Assertion-Aware Test Code Summarization with Large Language Models

Mollah, Anamul Haque, Aljohani, Ahmed, Do, Hyunsook

arXiv.org Artificial Intelligence

Unit tests often lack concise summaries that convey test intent, especially in auto-generated or poorly documented codebases. Large Language Models (LLMs) offer a promising solution, but their effectiveness depends heavily on how they are prompted. Unlike generic code summarization, test-code summarization poses distinct challenges because test methods validate expected behavior through assertions rather than implementing functionality. This paper presents a new benchmark of 91 real-world Java test cases paired with developer-written summaries and conducts a controlled ablation study to investigate how test code-related components-such as the method under test (MUT), assertion messages, and assertion semantics-affect the performance of LLM-generated test summaries. We evaluate four code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) across seven prompt configurations using n-gram metrics (BLEU, ROUGE-L, METEOR), semantic similarity (BERTScore), and LLM-based evaluation. Results show that prompting with assertion semantics improves summary quality by an average of 0.10 points (2.3%) over full MUT context (4.45 vs. 4.35) while requiring fewer input tokens. Codex and Qwen-Coder achieve the highest alignment with human-written summaries, while DeepSeek underperforms despite high lexical overlap. The replication package is publicly available at https://doi.org/10. 5281/zenodo.17067550


Automated and Explainable Denial of Service Analysis for AI-Driven Intrusion Detection Systems

Yakubu, Paul Badu, Santana, Lesther, Rahouti, Mohamed, Xin, Yufeng, Chehri, Abdellah, Aledhari, Mohammed

arXiv.org Artificial Intelligence

With the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks, it has become critical to develop more efficient and interpretable detection methods. Traditional detection systems often struggle with scalability and transparency, hindering real-time response and understanding of attack vectors. This paper presents an automated framework for detecting and interpreting DDoS attacks using machine learning (ML). The proposed method leverages the Tree-based Pipeline Optimization Tool (TPOT) to automate the selection and optimization of ML models and features, reducing the need for manual experimentation. SHapley Additive exPlanations (SHAP) is incorporated to enhance model interpretability, providing detailed insights into the contribution of individual features to the detection process. By combining TPOT's automated pipeline selection with SHAP interpretability, this approach improves the accuracy and transparency of DDoS detection. Experimental results demonstrate that key features such as mean backward packet length and minimum forward packet header length are critical in detecting DDoS attacks, offering a scalable and explainable cybersecurity solution.


Secure Code Generation at Scale with Reflexion

Datta, Arup, Aljohani, Ahmed, Do, Hyunsook

arXiv.org Artificial Intelligence

Abstract--Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at https://doi.org/10.5281/zenodo.17065846. LLMs such as GitHub Copilot, Codex, and DeepSeekCoder have made LLM-assisted coding common. Early studies focused on functionality and correctness [1], [2]: can models produce code that compiles and passes tests? Y et LLMs learn from large codebases that also contain design flaws. Recent work shows that low-quality code [3], [4] and vulnerabilities [5] can carry over into generated code.


Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Nemkova, Poli, Adhikari, Amrit, Pearson, Matthew, Sadu, Vamsi Krishna, Albert, Mark V.

arXiv.org Artificial Intelligence

Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.